Red Hat Enterprise Linux 7 Troubleshooting

Basic Troubleshooting Techniques and Procedures

Module Topics

Gathering Information
Problem Replication
Comparing, Isolating, and Testing

Gathering Information

Interviewing the Reporter

Avoid asking questions in an accusatory fashion.
- Instead of focusing on blame, try to gain the trust of the reporter.
- Put the reporter at ease.
- Focus on the reporter’s observation skills by rephrasing the question:
  "Have you noticed any recent changes to your system or environment?"
Remember that the reporter does not have your experience.
- The reporter is less likely to notice significant clues or warning signs.
- The reporter may not perceive changes as being related to the current problem.
Use every problem report as an opportunity to teach the reporter how to gather more information.
- Train the reporter to write down the exact error message text.
- Suggest that the reporter keep a log of changes he or she makes to the system.

In this section you will learn to gather sufficient information from the person who reported the problem.

One of the greatest challenges to troubleshooting any problem is gathering sufficient information from the reporter.

A standard question to ask the person who reported a problem is "What have you changed recently?" The challenge is that the reflex reply from the reporter is "Nothing." There are several reasons for this response:

The question is often asked in an accusatory fashion. Instead of focusing on blame, try to gain the trust of the reporter and put him or her at ease. Focus on the reporter’s observation skills by perhaps phrasing it as: "Have you noticed any recent changes to your system or environment?"
Remember that the reporter does not have your experience. He or she is less likely to notice significant clues or warning signs and may not perceive changes as being related to the current problem. For example, could the recent installation of a new wireless phone system in the building be causing sporadic networking issues?
Use every problem report as an opportunity to teach the reporter how to gather more information. For example, how many reporters write down the exact error message text? In this discussion of changes, suggest that the reporter keep a log of changes he or she makes to the system for easier recall.

Gathering Information

Review Available Logs

Read log files first!
- /var/log/messages is a good place to start.
- ls -lt /var/log/ lists log entries chronologically, with the newest at the top.
- Use the -r flag to reverse the order, i.e., newest at the bottom.
  [root@server1 ~]# ls -ltr /var/log
Use dmesg to read the contents of the kernel’s log buffer.
- The buffer has a finite size, so the oldest messages are dropped once the buffer fills.
- On x86/x86-64, the buffer is 512 KB in Red Hat Enterprise Linux 7.
- Kernel messages are generally logged to /var/log/messages.
- During boot, a snapshot of the log buffer is saved to /var/log/dmesg near the end of the rc.sysinit script.
- Check the kernel’s ring buffer, or change rsyslog to log kernel messages to a file; kernel oops are often visible only from the kernel’s ring buffer or directly on the console.
- Kernel oops are often caused when a part of the kernel fails; because the kernel is modular, this may eject a module (and dependencies) from the stack.
  [root@server1 ~]# less /var/log/dmesg [root@server1 ~]# dmesg | less
Some log files contain potentially useful information but can be hard to interpret, such as /var/log/audit/audit.log.
- Use tools such as ausearch to analyze audit logs and search for specific events.
  Example: Find all audit messages of type AVC (access vector cache) with ausearch -m AVC
  [root@desktop1 ~]# less /var/log/audit/audit.log [root@desktop1 ~]# ausearch -m AVC # SELinux messages [root@desktop1 ~]# ausearch -m LOGIN # Login messages

In this section you will learn about the following topics:

Extracting needed information from logs
Tuning applications to produce more output

Read log files first! The /var/log/messages file is usually a good place to start. Although not all applications and subsystems log directly into the messages file, most applications and subsystems do generate some type of notification. If no information is found in /var/log/messages and you are not sure which logs are being written, then the command ls -lt /var/log/ will list the log entries chronologically, with the newest at the top. Use the -r flag to reverse the ordering, i.e., newest at the bottom.
The contents of the kernel’s log buffer can be read by using the dmesg command. Due to the finite size of the buffer, the oldest log messages are dropped once the buffer fills up. (On x86/x86-64, the size of this buffer is 512 KB in Red Hat Enterprise Linux 7.) During the boot process, a snapshot of the current contents of the log buffer is saved to /var/log/dmesg near the end of the rc.sysinit script. Otherwise, kernel messages are generally logged to /var/log/messages. Checking the kernel’s ring buffer, or changing rsyslog to log kernel messages to a file, is worthwhile because kernel oops are often visible only from the kernel’s ring buffer or directly on the console (which for a remotely managed system may be infrequently seen). Kernel oops are often caused when a part of the kernel suffers a failure, as the kernel is modular, this may result in the ejection of that module (and dependencies) from the stack. This may or may not result in a kernel panic or have an obvious effect on a running system depending on which module/stack was affected.
Additionally, some log files contain potentially useful information but can be hard to interpret, such as /var/log/audit/audit.log. There may be additional tools that can help analyze these logs, such as ausearch, for searching the audit log for specific events. For example, the following command would find all audit messages of type AVC (access vector cache): ausearch -m AVC.

Gathering Information

Parsing Logs

Log files can contain many entries that are irrelevant for troubleshooting.
- Use grep -v 'ARG' /var/log/messages to remove irrelevant entries.
- To lose other messages without starting multiple grep commands, use extended regular expressions:
  [root@server1 ~]# grep -Ev 'ARG1|ARG2|ARG3' /var/log/messages
grep displays lines in a file that match a pattern.
- It can also process standard input when the filename argument is omitted.
- Patterns may contain regular expression metacharacters.
- It is good practice to quote regular expressions.
  Example: Using grep to search for a pattern in a log file
  [root@server1 ~]# grep 'sudo' /var/log/secure Feb 14 08:41:03 host sudo: ghacker : TTY=pts/1 ; PWD=/home/ghacker ; USER=root ; COMMAND=/bin/bash -l
Output from ps lists running processes.
- To list only lines that contain the string "init," run the following:
  [student@server1 ~]$ ps ax | grep 'init'

Gathering Information

Parsing Logs

Common grep options include:

Option Function

Option	Function
`-i`	Perform a case-insensitive search
`-v`	Exclude lines that contain the pattern
`-c`	Display a count of lines with the matching pattern
`-l`	Only list file names, do not display the matched lines
`-n`	Precede matched line with line numbers
`--color`	Highlight the matched string
`-A`, `-B`	When followed by a number, these options print that many lines after or before each match. This is useful for seeing the context in which a match appears within a file.
`-r`	Perform a recursive search of files, starting with the named directory

-i

Perform a case-insensitive search

-v

Exclude lines that contain the pattern

-c

Display a count of lines with the matching pattern

-l

Only list file names, do not display the matched lines

-n

Precede matched line with line numbers

--color

Highlight the matched string

-A, -B

When followed by a number, these options print that many lines after or before each match. This is useful for seeing the context in which a match appears within a file.

-r

Perform a recursive search of files, starting with the named directory

To change the color that --color uses (red by default), use the GREP_COLOR variable:
```
[student@server1 ~]$ export GREP_COLOR='01;34'
```

Use head and tail wisely to increase your system administration capabilities.

To see logfiles scroll in real time, run tail -f logfile in one terminal, while running commands in another terminal.

Use command line flags when scripting and parsing output.

Example: Output of ps aux

[student@server1 ~]# ps aux
USER   PID %CPU %MEM    VSZ   RSS TTY    STAT START   TIME COMMAND
root     1  0.0  0.0   2112   632 ?      Ss   Apr16   0:05 init [5]
root     2  0.0  0.0      0     0 ?      S<   Apr16   0:00 [kthread]
root     3  0.0  0.0      0     0 ?      S<   Apr16   0:04 [migration/0]
...

Specify that you want to start tail at a specific line number, such as 2:

[student@server1 ~]# ps aux | tail -n +2
root     1  0.0  0.0   2112   632 ?      Ss   Apr16   0:05 init [5]
root     2  0.0  0.0      0     0 ?      S<   Apr16   0:00 [kthread]
root     3  0.0  0.0      0     0 ?      S<   Apr16   0:04 [migration/0]
...

References

dir_colors(5) man page
/etc/DIR_COLORS

Common grep options are shown in this table.
To change the color that --color uses (red by default), use the GREP_COLOR variable:

Read the dir_colors(5) man page, or look at /etc/DIR_COLORS for more information and examples.
Using head and tail wisely can greatly increase your system administration capabilities. Running tail -f logfile in one terminal, while running commands in another terminal, you can see your logfiles scroll by in real time.
You can also gain some great benefits from using command line flags when scripting and parsing output. Take for example the output of ps aux:

All the columns are there, but the annoying first row makes it hard for you to automatically parse this output with another tool. Using tail, you can lose this first line by indicating that you want to start tail at line number 2: ps aux | tail -n +2. Note that with tail, the behavior changes when you put a + sign in front of the number of lines you want. Conversely with head, you must specify a negative number to change the behavior.

Gathering Information

Increasing Verbosity in Logs

For some applications you can increase verbosity in the application’s configuration file.
Example: In the CUPS printing system, change LogLevel Info to LogLevel Debug or LogLevel Debug2 in /etc/cups/cupsd.conf to increase verbosity.
```
[root@server1 ~]# grep LogLevel /etc/cups/cupsd.conf
```
Some command line tools have an option to increase verbosity.
- This is often the -v flag.
- Some tools accept multiple flags to steadily increase debugging output.
  Example: tcpdump accepts -v or -vv or even -vvv to increase debug output.
- If you are not sure which flag to pass, use -h or -- for usage and help output.
  [root@server1 ~]# lspci -vvv | less
  References
  - grep(1), dir_colors(5), head(1) and tail(1) man pages

Problem Replication

A key to ensuring success at resolving problems is being able to replicate the problem at will.
Being able to repeat the problem gives you a greater understanding of the problem triggers.
You may see significant clues or symptoms that the reporter missed.
You can use focused monitoring tools to expose hidden information.
If you cannot replicate the problem, how do you truly know that the problem was fixed?

Comparing, Isolating, and Testing

Comparing

Compare a broken system with a similar system that is functioning properly.
- Choose systems that are as close to identical as possible including hardware configuration, OS configuration, and installed applications.
- Do not adjust the healthy system.
Compare system logs and program output between the two systems.
Compare configuration files.
- Make copies of /etc from both systems and run a recursive diff.
Use performance monitoring tools to compare machine performance.
- sar compares metrics taken over a long period of time.
- top compares processes that are currently active.
Compare two machines with the same software installed and congruent configurations, but with a different hardware platform.

In this section you will learn about the following topics:

Comparing performance or configuration differences with other systems or the baseline
Performing root-cause analysis
Verifying the fix to an isolated problem

It is very useful to compare a broken system with a similar system that is functioning properly. The systems to be compared should be as close to identical as possible. This includes hardware configuration, operating system configuration, and installed applications. Do not make any adjustments to the healthy system, otherwise you may have two systems to troubleshoot instead of one.

Compare system logs and program output between the two systems. Sometimes this helps locate relevant error messages more quickly. Once you identify the errors, you can correct the problem and then test the fix.

Another point of comparison is configuration files. Make copies of the /etc directory from both systems and run a recursive diff to identify differences. This process is a very efficient way to pinpoint configuration information that needs more detailed examination.

Performance monitoring tools, such as sar and top allow you to compare machine performance between two systems. sar allows for comparison of metrics taken over a long period of time, and top allows you to compare processes that are currently active.

Finally, another possibility is to compare two machines that have the same software installed and congruent configurations, but have a different hardware platform. This identifies problems where the hardware is overextended and is running software beyond its capacity.

Comparing, Isolating, and Testing

Isolating

Avoid taking steps based on the first identified cause.
- This can lead to a solution that does not address the root cause of the problem.
Brainstorm as many causes as possible that might produce the problem’s symptoms.
- The goal is to solve the root cause of a problem not to simply address symptoms.
- Rank all the possible causes by difficulty to troubleshoot and solve.
- Start with the easiest causes and work down to the most complex causes.

A common troubleshooting mistake is to take the first identified cause of a problem and immediately take steps to address that problem. This can lead to a solution that does not address the root cause of the problem.

A better approach is to brainstorm as many causes as possible that might produce the symptoms that your broken system displays. The goal of troubleshooting is to solve the root cause of a problem rather than simply addressing symptoms.

After you identify all the possible causes, rank them by difficulty to troubleshoot and solve. Start with the easiest causes first, and then work down the list to the causes that are most complex to solve. This approach to troubleshooting is more efficient because it eliminates the causes that are easiest to diagnose first, rather than spending a lot of time troubleshooting a complex cause that may not be relevant.

Comparing, Isolating, and Testing

Testing

Make one change at a time.
Test the impact of each change.
Making more than one change at a time adds variables of complexity.

Module Completion

Nice job!

Click the button below to complete this module of the course: